Large Language Models

Malo Jan & Luis Sattelmayer

2025-01-06

Climbing the ladder of abstraction in NLP

  • Bag of Words
    • text is represented as a collection of individual words without considering order or context
  • Word embeddings
    • dense, fixed-size vectors capturing word semantics
    • e.g. Word2Vec, GloVe, FastText
  • Today: Large Language Models (LLMs)
    • context-aware embeddings using attention mechanisms
    • e.g. BERT, GPT, Llama & Co

Feature Bag-of-Words Static Embeddings Transformers
Core Idea Word count representation Dense, fixed-size vectors capturing word semantics. Context-aware embeddings using attention mechanisms
Representation Sparse vectors (e.g., one-hot encoding, counts) Dense vectors (e.g., 300 dimensions). Contextualized vectors generated dynamically
Semantics None (words treated independently) Semantic similarity but context-independent. Semantic and context-aware understanding
Polysemy Handling Cannot distinguish between meanings Single vector for all senses of a word. Context-aware disambiguation
Methods TF-IDF, word counts, scaling methods Word2Vec, GloVe, FastText. LLMs: BERT, GPT, Llama & Co
Weaknesses Ignores order and context, sparse Context-blind, limited for nuanced tasks. Computationally expensive, requires large data

Large Language Models

Neural Networks

  • A profile in the New Yorker about Geoffrey Hinton

Neural Networks

  • Computational model inspired by the way biological neural networks in the human brain process information
  • Designed to recognize patterns, make predictions, or classify data by mimicking the interconnected structure of neurons in the brain

A simple one-layered NN

Deep Learning

  • Oftentimes deep learning & NNs used interchangeably

  • Deep refers to the number of layers in the network (i.e. the depth)

  • Weights: values that define the strength of the connection between two neurons in adjacent layers of a neural network

  • Biases: parameters that allow the network to shift the activation function up or down

  • Activation Function: decides whether a neuron should “fire” (pass its signal forward) and how strong that signal should be

    • necessary to introduce non-linearity into the network and handle complex patterns in the data

Transformer Architecture

  • Also a NN, state of the art since early 2018

  • “Attention is all you need” (Vaswani 2017)

  • Attention Mechanism:

    • allows the model to focus on different parts of the input sequence when predicting the next token
    • crucial for understanding the context of a word in a sentence

Attention in Transformers

  • For a visualization see here
  • Viz below taken from Clark (2019)

Training a transformer

  • After the model was given huge amounts of raw text, it learns causal language modeling and masked language modeling

Causally predicting the next word

Masked Language Modeling

Transfer learning & fine-tuning

  • LLMs are trained on large corpora and thus contain a lot of knowledge about the (written) world
  • This knowledge can be transferred to other similar and related tasks
  • Transfer learning adds a new layer of neural network on top of the pre-trained model
  • Fine-tuning a pre-trained model on a specific task is often faster and requires less data than training a model from scratch
  • The model is fine-tuned on a smaller dataset to adapt to the specifics of the new task
  • For fine-tuning, The weights within the model are adjusted to the new task

Transfer learning

From pretrained models…

… to finetuning

Finetuning visualized (Do, Ollion, and Shen 2024)

Nvidia stock development

What is GPU?

  • Graphics Processing Unit:
    • a specialized processor to render images and video
    • parallel processing capabilities
    • complements the CPU (Central Processing Unit), which excels at sequential tasks
    • optimized for matrix and vector operations; crucial for neural network computations
  • CUDA: Compute Unified Device Architecture
    • a parallel computing platform and application programming interface model created by NVIDIA
    • allows software developers to use a CUDA-enabled GPU for general purpose processing

GPU for LLMs

  • Massive computation requirements
  • Memory bandwidth
  • Parallelization
  • A task that would take days with a CPU can be done in hours with a GPU
  • This does mean that working with LLMs…
    • … requires patience, time, and unfortunately computational ressources

BERT

  • Bidirectional Encoder Representations from Transformers
    • masked language model for NLP tasks
    • only uses encoder part of the architecture
    • bi-directional attention, meaning it looks at the context from both the left and right sides of a word
  • RoBERTa: Removed NSP and trained on larger datasets with more robust techniques
  • DistilBERT: Smaller and faster version of BERT
  • and other variants trained on different languages, tasks, etc

BERT vs. GPT

BERT GPT
Type Encoder-only Decoder-only
Training Masked Language Modeling (MLM) Causal Language Modeling (CLM)
Direction Bi-directional Uni-directional
Task Focus Language understanding Language generation
Strength Deep sentence/context understanding Coherent and fluent text generation

Evolution of LLMs

Multilingual Models

Issues with LLMs

  • Carbon footprint
    • on quantifying ML’s carbon footprint see Lacoste et al. (2019)
  • LLMs hardly solve the problem of bias in AI
  • GPU dependent
  • Data privacy:
    • without personal GPU, you will have to think a lot about your data and whether it ought to be protected
    • some corpora (e.g. copyrighted data) should not be seen by third parties (Google Colab)

Poli-sci examples

  • Measure populist frames in text: Bonikowski, Luo, and Stuhler (2022)
  • Anti-elite rhetoric: Licht et al. (2024)
  • Social groups: Licht and Sczepanksi (2024)

What other use-cases for Transformers?

  • This leaves the realm of
  • Image generation
    • Dall-E
    • Midjourney
    • Stable diffusion
  • speech-to-text
    • Whisper

Whisper

  • For the GitHub tutorial click here

Developed by OpenAI

Furhter ressources

Bibliography

Bonikowski, Bart, Yuchen Luo, and Oscar Stuhler. 2022. “Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in US Presidential Campaigns (1952–2020) with Neural Language Models.” Sociological Methods & Research 51 (4): 1721–87.
Clark, Kevin. 2019. “What Does Bert Look at? An Analysis of Bert’s Attention.” arXiv Preprint arXiv:1906.04341.
Do, Salomé, Étienne Ollion, and Rubing Shen. 2024. “The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy.” Sociological Methods & Research 53 (3): 1167–1200.
Lacoste, Alexandre, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. “Quantifying the Carbon Emissions of Machine Learning.” arXiv Preprint arXiv:1910.09700.
Laurer, Moritz, Wouter Van Atteveldt, Andreu Casas, and Kasper Welbers. 2024. “Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI.” Political Analysis 32 (1): 84–100.
Licht, Hauke, Tarik Abou-Chadi, Pablo Barberá, and Whitney Hua. 2024. “Measuring and Understanding Parties’ Anti-Elite Strategies.”
Licht, Hauke, and Ronja Sczepanksi. 2024. “Who Are They Talking about? Detecting Mentions of Social Groups in Political Texts with Supervised Learning.” ECONtribute Discussion Paper.
Vaswani, A. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems.